A Bayesian Perspective on Generalization and Stochastic Gradient Descent
نویسندگان
چکیده
We consider two questions at the heart of machine learning; how can we predict if a minimum will generalize to the test set, and why does stochastic gradient descent find minima that generalize well? Our work responds to Zhang et al. (2016), who showed deep neural networks can easily memorize randomly labeled training data, despite generalizing well on real labels of the same inputs. We show that the same phenomenon occurs in small linear models. These observations are explained by the Bayesian evidence, which penalizes sharp minima but is invariant to model parameterization. We also demonstrate that, when one holds the learning rate fixed, there is an optimum batch size which maximizes the test set accuracy. We propose that the noise introduced by small mini-batches drives the parameters towards minima whose evidence is large. Interpreting stochastic gradient descent as a stochastic differential equation, we identify the “noise scale” g = (NB −1) ≈ N/B, where is the learning rate, N the training set size and B the batch size. Consequently the optimum batch size is proportional to both the learning rate and the size of the training set, Bopt ∝ N . We verify these predictions empirically.
منابع مشابه
A PAC-Bayesian Analysis of Randomized Learning with Application to Stochastic Gradient Descent
We analyze the generalization error of randomized learning algorithms—focusingon stochastic gradient descent (SGD)—using a novel combination of PAC-Bayesand algorithmic stability. Importantly, our risk bounds hold for all posterior dis-tributions on the algorithm’s random hyperparameters, including distributions thatdepend on the training data. This inspires an adaptive sampling...
متن کاملStochastic Gradient Descent as Approximate Bayesian Inference
Stochastic Gradient Descent with a constant learning rate (constant SGD) simulates a Markov chain with a stationary distribution. With this perspective, we derive several new results. (1) We show that constant SGD can be used as an approximate Bayesian posterior inference algorithm. Specifically, we show how to adjust the tuning parameters of constant SGD to best match the stationary distributi...
متن کاملElastic Distributed Bayesian Collaborative Filtering
In this paper, we consider learning a Bayesian collaborative filtering model on a shared cluster of commodity machines. Two main challenges arise: (1) How can we parallelize and distribute Bayesian collaborative filtering? (2) How can our distributed inference system handle elasticity events common in a shared, resource managed cluster, including resource ramp-up, preemption, and stragglers? To...
متن کاملDeep Learning: A Bayesian Perspective
Deep learning is a form of machine learning for nonlinear high dimensional pattern matching and prediction. By taking a Bayesian probabilistic perspective, we provide a number of insights into more efficient algorithms for optimisation and hyper-parameter tuning. Traditional high-dimensional data reduction techniques, such as principal component analysis (PCA), partial least squares (PLS), redu...
متن کاملIdentification of Multiple Input-multiple Output Non-linear System Cement Rotary Kiln using Stochastic Gradient-based Rough-neural Network
Because of the existing interactions among the variables of a multiple input-multiple output (MIMO) nonlinear system, its identification is a difficult task, particularly in the presence of uncertainties. Cement rotary kiln (CRK) is a MIMO nonlinear system in the cement factory with a complicated mechanism and uncertain disturbances. The identification of CRK is very important for different pur...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1710.06451 شماره
صفحات -
تاریخ انتشار 2017